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Abstract 

The ratios of the codon usage in the quartets and sextets for the vertebrate series exhibit 
a correlated behaviour which fits naturally in the framework of the crystal basis model of 
the genetic code Q. Moreover the observed universal behaviour of these suitably normalized 
ratios can be easily explained. 



LAPTH-709/98 
physics/9812041 
December 1998 



1 Introduction 



It is a well known and intriguing fact that, in the genetic code, 64 codons code the biosynthesis 
of 20 amino-acids (a. a.) with a structure in multiplets reported in Table |I]. 

It is also a well known and, at our knowledge, unexplained fact that the frequency rate of usage 
(codon usage) of the different codons inside a multiplet is not the same. 

It is the aim of this paper to emphasize for the vertebrate series a correlation in the codon usage, 
in the quartets and sextets, which is naturally explained in the framework of the mathematical 
model of the genetic code recently proposed by the authors |I| . Moreover we put in evidence an 
universal function behaviour connected with the codon usage, which also finds a justification in 
the model. 

In Sec. 2 we recall the essential features of the model and in Sec. 3 we present the analysis of the 
codon usage for a set of biological sequences in the vertebrate series ||. 

2 The crystal basis model 
2.1 The crystal basis 

Let us briefly recall some properties of the crystal basis @. We limit ourselves to the case of 
I4g(sl(2)), but such basis exists for any finite dimensional representation of U q ^o{Q) , Q being any 
(semi)-simple classical Lie algebra. The crystal basis has the nice property, for U q (sl(2)), that in 
the limit q — > 

J+u k = Uk+i for < k < 2 J (1) 
J- u k = Uk~i for < k < 2 J (2) 
J 3 Uk = (k- J) u k for < k < 2 J (3) 

and 

J+ u 2 j = J- u = (4) 

where the operators J± are a redefinition, using an element of the center, of the generators J± of 
U g (sl(2)), J3 = J3, and it& are the basis vectors of the irreducible representation labelled by J (J 
being an integer or half-integer). The labels of the irreducible representation are connected to the 
eigenvalues of the "Casimir" operator C: 

1 n ~ ~ 

c = ( J 3 ) 2 + 2 E E( J -) n ~ fe ( J +) n ( J -) fc • ( 5 ) 

ngZ + A:=0 

Its eigenvalue on any vector basis of the irreducible J-representation is J(J + 1). 
Moreover any state in the tensor product of two irreducible representations R\ <S> R2 is written in 
the crystal basis as one and only one tensor product of a Ri state by a R2 state 0. For example, 
taking for Ri and R2 the two-dimensional representation J = \ oilA q ^o(sl{2)) with states |+) and 
|— ), one will get in Ri ® R 2 the J = 1 representation displayed by |+, +), |— , +) and |— , — ), and 
the J = representation with the state |+, — ). 

Now let us state the main hypothesis of our model [|TJ. 
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2.2 The assumptions 



Assumption I - The four nucleotides containing the bases: adenine (A) and guanine (G), deriving 
from purine, and cytosine (C) and thymine (T), coming from pyrimidine, are the basis vectors of 
a crystal basis of the (1/2, 1/2) irreducible representation of the quantum algebra U q (sl{2) @sl{2)) 
in the limit q — > 0. In the following, we denote with ± the basis vector corresponding to the 
eigenvalue ±1/2 of J|, where a = H (V) specifies the generator of the first (second) si (2). We 
assume the following "spin" structure (we remind that the thymine T in the DNA is replaced by 
the uracile U in the RNA): 

sl{2) H 

C =(+,+) <— ► U=(-+) 

sl(2) v I I sl(2) v (6) 

G =(+,-) «— ► A = (-, -) 
sl(2) H 

Let us remark that the if-symmetry is associated to the purine-pyrimidine structure, while the 
I/-symmetry reflects the complementarity rule (that is A — T/U and C — G interactions). 

Assumption II - The codons are the basis vectors, in the crystal basis, of the irreducible represen- 
tations build up by the tensor product of three 4-dimensional (1, |) fundamental representations 
describing the nucleotides. 

We have reported in Table [l] the assignment of the codons classified in the representations which 
appear in the r.h.s. of the following relation: 

&i) ® (|,|) ® (U) = (|,|) © 2(1,1) © 2(i,|) © 4(1,1) (7 ) 

In |l|, an operator (called the reading or ribosome operator) 1Z has been constructed out of 
the algebra U q ^o(sl(2) ffis/(2)), which describes the multiplet structure of the the genetic code in 
the following way: two codons have the same eigenvalue under 1Z if and only if they are associated 
to the same amino-acid. Moreover an "Hamiltonian" depending on 4 parameters has been build 
up which gives a very satisfactory fit of the 16 values of the free energy released in the folding of 
a RNA sequence into a base paired double helix. 

Let us close this section by drawing the reader's attention to Fig. [I| where is specified for each 
codon its position in the appropriate representation. The diagram of states for each representation 
is supposed to lie in a separate parallel plane. Thick lines connect codons associated to the 
same amino-acid. One remarks that each segment relates a couple of codons belonging to the 
same representation or to two different representations. This last case occurs for quadruplets or 
sextets of codons associated to the same amino-acid. It is the purpose of this letter to show a 
remarkable relation between such multiplets of codons (or amino-acids) involving the same subset 
of representation and (branching ratios of) the probabilities of presence of codons in the amino-acid 
biosynthesis. 
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3 Correlation of codon usage 



We define the codon usage as the frequency of use of a given codon in the process of biosynthesis 
of all the amino-acids. We define the probability of usage of the codon XYZ of a given amino-acid 
as the ratio between the occurrence of the codon XYZ and the occurrence N of the corresponding 
amino-acid, i.e. as the relative codon frequency, in the limit of very large N. Here and in the 
following the labels X, Y, Z, V represent the bases C, U, G, A. The frequency rate of usage of a 
codon in a multiplet is connected to its probability of usage PiXYZ — > a.a.). It is reasonable 
to assume that PiXYZ — > a.a.) depends on: - the biological organism (b.o.) from which the 
sequence considered has been extracted 

- the sequence analyzed 

- the nature of the neighboring codons in the sequence 

- the amino-acid (a.a.) 

- the nature and structure of the multiplet associated to the amino-acid 

- the biological environment 

- the properties of the codon itself (XYZ). 

We neglect the time in which the biosynthesis process takes place as we assume that the 
biosynthesis processes are considered at the same time, at least compared to the time scale of 
evolution of the genetic code. We define the branching ratio Bzv as 



We argue that in the limit of very large number of codons, for a fixed biological organism and 
amino-acid, the branching ratio depends essentially on the properties of the codon. In our model 
this means that in this limit Bzv is a function, depending on the type of the multiplet, on the 
quantum numbers of the codons XYZ and XYV, i.e. on the labels J a , , where a = H or V, and 
on an other set of quantum labels leaving out the degeneracy on J a ; in Table different irreducible 
representations with the same values of J a are distinguished by an upper label. Moreover we 
assume that Bzv, i n the limit above specified, depends only on the irreducible representation (IR) 
of the codons, i.e.: 



Let us point out that the branching ratio has a meaning only if the codons XYZ and XYU are 
in the same multiplet, i.e. if they code the same amino-acid. 

We consider the quartets and sextets. There are five quartets and three sextets in the eukariotic 
code: that will allow a rather detailed analysis. Moreover the 3 sextets appear as the sum of a 
quartet and a doublet, see Table |l|. In the following we consider only the quartet sub-part of the 
sextets. We recall that the 5 amino-acids coded by the quartets are: [Pro, Ala, Thr, Gly ,Val] 
and the 3 amino-acids coded by the sextets are: [Leu, Arg, Ser]. There are, for the quartets, 
6 branching ratios, of which only 3 are independent. We choose as fundamental ones the ratios 
Bag, B C g an d B V c- It happens that we can define several functions B Z v, considering ratios of 
probability of codons differing for the first two nucleotides XY, i.e. 



Bzv 





B zv = Fzvip.o.- IR(XYZ); IR(XYV)) 



(9) 



Fzvip.o.; IR(XYZ); IR(XYV)) 
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B' zv = F zv (b.o.-IR{X'Y'Z)-IR{X'Y'V)) (10) 

Then if the codon XYZ (XYV) and X'Y'Z (X'Y'V) are respectively in the same irreducible 
representation, it follows that 

Bzv = B' zv (11) 

The analysis was performed on a set of data retrieved from the data bank of "Codon usage 
tabulated from GenBank" ||. In particular we analyzed two different data set: the first one 
comprises all the data of at least 2 000 codons, while the second set represents all the data with 
at least 30 000 codons. The referring organism for the analysis was Homo sapiens, whose codon 
usage table derives from the analysis of more than 12 500 coding sequences, and corresponds to 
about 6 000 000 codons. 

Three quartets, coding the amino-acids Pro, Ala and Thr, have exactly the same content in 
irreducible representations, see Table [H In Table |2] we report the 16 biological organisms with 
highest statistics. In Figs. 0, [5] and f| the Bag , B UG and B C g are reported for the 8 amino-acids 
coded by the quartets and sextets showing: 

• a clear correlation between the four amino-acids Pro, Ala, Thr and Ser. From Table [1] we 
see that for these amino-acids the irreducible representation involved in the numerator of 
the branching ratios (see (|J)) is always the same: (1/2, 1/2) 1 for Bag, (1/2, 3/2) 1 for Bug, 
(3/2, 3/2) for Bqg, while the irreducible representation in the denominator is (1/2, 1/2) 1 for 
the whole set. The relative position of each of these quartets of codons can be more easily 
visualized in Fig. [I] where Pro, Ala, Thr and Ser (quartet part) constitute the four edges of 
a vertical column linking the representation (1/2, 1/2) 1 , sitting at the ground floor, first to 
the representation (3/2, 1/2) 1 , then to the (1/2, 3/2) 1 one and finally to the representation 
(3/2,3/2), this last one located at the top floor. 

• a clear correlation between the two amino-acids Val and Leu. From Table [l| we see that 
also for these two amino-acids the irreducible representation in the numerator of @ is the 
same: (1/2, 1/2) 3 for B AG , (1/2, 3/2) 2 for B UG , (1/2, 3/2) 2 for B CG , and the irreducible 
representation in the denominator is (1/2, 1/2) 3 . Considering Fig. [I], it is now the two 
representations (1/2, 1/2) 3 and (1/2, 3/2) 2 which are brought together, the codons associated 
to Val and Leu (quartet part) determining the vertices of two parallel and vertical plaquettes. 

• no correlation of the Arg and also of the Gly with the others amino-acids, in agreement 
with the irreducible representation assignment of Table |l|. Indeed we can note in Fig. |l 
that the representations (1/2, 1/2) 2 and (3/2, 1/2) 2 are connected by the codon quartet 
relative to Arg and (but) only by this multiplet. We also remark the Gly quartet in the 
representations (1/2, 3/2) 1 and (3/2, 3/2): its position is completely different from the above 
discussed quartets which show up in these representations. 

Then in Figs. [5], ^| and |7| we have drawn the normalized branching ratios Bpc P G {A, U, C}, 
defined by: 

B Pa =^L- (12) 
l^a.a. Bag 
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where the sum Yl a a * s extended to the eight amirio-acids above listed. The mean value and the 
standard deviation are: 





Pro Ala Thr Ser Val Leu Arg Gly 


{B AG ) 
o{B AG ) 


1.60 1.46 1.57 1.61 0.16 0.11 0.50 1.00 
0.16 0.16 0.17 0.21 0.03 0.02 0.15 0.29 


(Bcg) 
°{Bcg) 


1.24 1.66 1.46 1.83 0.26 0.23 0.60 0.73 
0.15 0.18 0.15 0.23 0.05 0.04 0.16 0.18 


(Bug) 
o(B ug ) 


1.49 1.71 1.24 2.07 0.25 0.19 0.45 0.60 
0.26 0.13 0.14 0.32 0.06 0.04 0.22 0.22 



These diagrams show an universal behaviour of Bpc which has the same value independently 
of the biological organism. We have omitted in the diagram the branching ratio of the amino-acid 
Gly as it is dependent from the branching ratios of the other amino-acids due to our definition 



eq. (12). In our model this behaviour can easily be understood if the branching ratio Bzv has 
the factorized form 

Bzv = $zv(b.o.) ^ ZV (IR(XYZ); IR(XYV)) (13) 

This factorization explains also the correlation in the behaviour between the values of Bpc for 
different biological organisms, see Figs. 0, |3| and [|. Finally we report in the table below the mean 
value and the standard deviation for the case of biological organisms with low statistics to put in 
evidence the effects of the statistics. 





Pro Ala Thr Ser Val Leu Arg Gly 


(Bag) 
o(B AG ) 


1.77 1.49 1.67 1.13 0.17 0.15 0.61 1.01 
0.67 0.40 0.49 0.56 0.09 0.18 0.34 0.44 


(Bcg) 
°{Bcg) 


1.29 1.55 1.52 1.82 0.25 0.23 0.62 0.71 
0.47 0.39 0.41 0.53 0.08 0.08 0.32 0.32 


(Bug) 
<?{Bug) 


1.50 1.61 1.26 2.09 0.25 0.19 0.51 0.60 
0.58 0.39 0.39 0.64 0.10 0.09 0.28 0.32 



4 Conclusions 



The basic elements of our model of the genetic code are the 4 nucleotides and the 64 codons come 
out as composed states. The symmetry algebra U q ^ (sl(2) © si (2)) has two main characteristics. 
Firstly, it encodes the stereochemical property of a base confering quantum numbers to each 
nucleotide. Secondly, it admits representation spaces with the remarkable property that the 
vector bases of the tensor product are ordered sequences of the basic elements (nucleotides) . The 
model does not necessarily assign the codons in a multiplet (in particular the quartets, sextets 
and triplet) to the same irreducible representation. This feature is relevant. Indeed, as we have 
shown in this paper, it may explain the correlation between the branching ratio of the codon 
usage of different codons coding the same amino-acid. Let us remark that the assignments of the 
codons to the different irreducible representations is a straightforward consequence of the tensor 
product once assigned the nucleotides to the fundamental irreducible representation, see our first 
assumption. 
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It is a prevision of our model that for any biological organism belonging to the vertebrate 
series, in the limit of large number of biosynthetized amino-acids, the ratios Bag, B ug and B CG 
for, respectively, Pro, Ala, Thr and Ser (Val and Leu) should be very close. Let us remark that 
obviously these ratios depend on the biological organism and we are unable to make any prevision 
on their values, but only that their values should be correlated. Our analysis has also shown 
an universal behaviour of the normalized branching ratio of the codon usage for the vertebrates, 
which was not evidently expected in our model, but which can easily be explained assuming a 
factorized form for the Bzv- So, assuming the factorization (|T3"D, we foresee that the normalized 
ratio Bag, Bug and Bcg should be given for any biological organism by the values reported in 
Figs, g, | and |7|. 

A first analysis including biological organisms belonging to the invertebrate and plant series 
show that the pattern of correlation is still present, even in a less striking way, but significant 
deviations appear for some biological organisms. A more detailed analysis with extension to the 
other multiplets, in particular the doublets, and to other series of biological organisms will be 
done in a further more detailed publication. 
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Figure 1: Classification of the codons in the different crystal bases. 
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Figure 1 (cont'd) 
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Figure 3: Branching ratio B C g for the vertebrate series. 
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Figure 4: Branching ratio B UG for the vertebrate series. 
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Table 1: The eukariotic code. The upper label denotes different IR. 



codon a.a. 


Jh Jv 


codon a.a. 


Jh Jv 


CCC Pro 
CCU Pro 
CCG Pro 
CCA Pro 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


UCC Ser 
UCU Ser 
UCG Ser 
UCA Ser 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


CUC Leu 
CUU Leu 
CUG Leu 
CUA Leu 


(1/2 3/2)* 
(1/2 3/2) 2 
(1/2 1/2) 3 
(1/2 1/2) 3 


UUC Phe 
UUU Phe 
UUG Leu 
UUA Leu 


3/2 3/2 
3/2 3/2 
(3/2 1/2) 1 
(3/2 1/2) 1 


CGC Arg 
CGU Arg 
CGG Arg 
CGA Arg 


(3/2 1/2)* 
(1/2 1/2) 2 
(3/2 1/2) 2 
(1/2 1/2) 2 


UGC Cys 
UGU Cys 
UGG Trp 
UGA Ter 


(3/2 1/2) 2 
(1/2 1/2) 2 
(3/2 1/2) 2 
(1/2 1/2) 2 


CAC His 
CAU His 
CAG Gin 
CAA Gin 


(1/2 1/2) 4 
(1/2 1/2) 4 
(1/2 1/2) 4 
(1/2 1/2) 4 


UAC Tyr 
UAU Tyr 
UAG Ter 
UAA Ter 


(3/2 1/2) 2 
(3/2 1/2) 2 
(3/2 1/2) 2 
(3/2 1/2) 2 


GCC Ala 
GCU Ala 
GCG Ala 
GCA Ala 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


ACC Thr 
ACU Thr 
ACG Thr 
ACA Thr 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


GUC Val 
GUU Val 
GUG Val 
GUA Val 


(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 1/2) 3 
(1/2 1/2) 3 


AUC He 
AUU He 
AUG Met 
AUA He 


3/2 3/2 
3/2 3/2 
(3/2 1/2) 1 
(3/2 1/2) 1 


GGC Gly 
GGU Gly 
GGG Gly 
GGA Gly 


3/2 3/2 
(1/2 3/2) 1 
3/2 3/2 
(1/2 3/2) 1 


AGC Ser 
AGU Ser 
AGG Arg 
AGA Arg 


3/2 3/2 
(1/2 3/2) 1 
3/2 3/2 
(1/2 3/2) 1 


GAC Asp 
GAU Asp 
GAG Glu 
GAA Glu 


(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 3/2) 2 


AAC Asn 
AAU Asn 
AAG Lys 
AAA Lys 


3/2 3/2 
3/2 3/2 
3/2 3/2 
3/2 3/2 



Table 2: Biological organisms with highest statistics. 





Biological organism 


number of sequences 


number of codons 


1 


Homo sapiens 


12 512 


6 130 940 


2 


Gallus gallus 


1 319 


638 532 


3 


Xenopus laevis 


1 144 


493 437 


4 


Bos taurus 


1 182 


478 270 


5 


Oryctolagus cuniculus 


639 


321 129 


6 


Sus scrofa 


539 


216 654 


7 


Danio rerio 


259 


99 766 


8 


Canis familiaris 


230 


94 444 


9 


Ovis aries 


275 


81 177 


10 


Oncorhynchus mykiss 


128 


42 794 


11 


Macaca mulatta 


110 


34 510 


12 


Fugu rubripes 


63 


32 943 


13 


Cyprinus carpio 


95 


32 365 


14 


Equus caballus 


94 


31 254 


15 


Rana cates beiana 


61 


30 629 


16 


Fclis catus 


83 


30 031 
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